Text Mining - Sentiment Analysis - Macbeth

(adapted from Text Mining with R: a Tidy Approach by J. Silge and D. Robinson)

In this notebook, you will analyze the emotional content in Shakespeare's Macbeth.

SECTIONS

  1. Introduction
  2. Sentiment Lexicons
  3. Sentiment Analysis Workflow
  4. Data Preparation
  5. Sentiment Analysis of Macbeth
  6. Exercises

Back to top

INTRODUCTION

Most humans have a good native understanding of the emotional intent of words, which leads us to infer surprise, disgust, joy, pain, and so forth, from a text segment.

The process, when applied by machines, is called sentiment analysis or opinion mining.

There are numerous challenges, as one can imagine, namely:

  1. sentients don't always agree on the emotional content of a work
  2. words may have different meaning/emotional value depending on the context (anti-antonyms)
  3. qualifiers can drastically change a term's emotional value

Sentiment analysis is a supervised learning problem, which requires dictionaries of emotional content to have been compiled ahead of time.


Back to top

SENTIMENT ANALYSIS WORKFLOW

In general, we can analyze the sentiment using the word-by-word method as follows:

  1. start with Text Data
  2. un-nest the tokens to produce the first iteration of Tidy Text
  3. clean the Tidy Text as required
  4. join the Tidy Text to an appropriate Sentiment Lexicon
  5. summarize the Tidy Text/Sentiment Lexicon into a first iteration of Summarized Text
  6. clean and analyze the Summarized Text
  7. visualize and present the Text Mining results

Back to top

SENTIMENT LEXICONS

We will use the sentiment lexicons that are included with the tidytext dataset: AFINN, nrc, bing, loughran.

In [1]:
library(tidytext)

head(sentiments) # get all the lexicons as one tibble

table(sentiments$lexicon) # see what lexicons are available

AFINN = get_sentiments("afinn") # words on a scale from -5 (negative) to 5 (positive)
BING = get_sentiments("bing") # binary negative/positive
NRC = get_sentiments("nrc") # assigns categories of sentiments (possible more than one to a term)
LOUGHRAN = get_sentiments("loughran")
Out[1]:
wordsentimentlexiconscore
abacus trust nrc NA
abandon fear nrc NA
abandon negative nrc NA
abandon sadness nrc NA
abandonedanger nrc NA
abandonedfear nrc NA
Out[1]:
   AFINN     bing loughran      nrc 
    2476     6788     4149    13901 

Let's take a quick look at the 4 lexicons (they do not all contain the same number of observations).

In [2]:
AFINN = get_sentiments("afinn") # words on a scale from -5 (negative) to 5 (positive)
BING = get_sentiments("bing") # binary negative/positive
NRC = get_sentiments("nrc") # assigns categories of sentiments (possible more than one to a term)
LOUGHRAN = get_sentiments("loughran")

str(AFINN)
str(BING)
str(NRC)
str(LOUGHRAN)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	2476 obs. of  2 variables:
 $ word : chr  "abandon" "abandoned" "abandons" "abducted" ...
 $ score: int  -2 -2 -2 -2 -2 -2 -3 -3 -3 -3 ...
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	6788 obs. of  2 variables:
 $ word     : chr  "2-faced" "2-faces" "a+" "abnormal" ...
 $ sentiment: chr  "negative" "negative" "positive" "negative" ...
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	13901 obs. of  2 variables:
 $ word     : chr  "abacus" "abandon" "abandon" "abandon" ...
 $ sentiment: chr  "trust" "fear" "negative" "sadness" ...
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	4149 obs. of  2 variables:
 $ word     : chr  "abandon" "abandoned" "abandoning" "abandonment" ...
 $ sentiment: chr  "negative" "negative" "negative" "negative" ...

The sentiment categories (and distributions) can be accessed using basic R commands.

In [3]:
table(AFINN$score)
table(BING$sentiment)
table(NRC$sentiment)
table(LOUGHRAN$sentiment)
Out[3]:
 -5  -4  -3  -2  -1   0   1   2   3   4   5 
 16  43 264 965 309   1 208 448 172  45   5 
Out[3]:
negative positive 
    4782     2006 
Out[3]:
       anger anticipation      disgust         fear          joy     negative 
        1247          839         1058         1476          689         3324 
    positive      sadness     surprise        trust 
        2312         1191          534         1231 
Out[3]:
constraining    litigious     negative     positive  superfluous  uncertainty 
         184          903         2355          354           56          297 

At a first glance, it seems that there are more terms in the negative end of the "spectra". What kind of an effect do you think that could have?

How do the various lexicons grade specific words? Let's take a look at a few possibilities:

In [4]:
word = "abandon"
AFINN[AFINN$word == word,]
BING[BING$word == word,]
NRC[NRC$word == word,]
LOUGHRAN[LOUGHRAN$word == word,]
Out[4]:
wordscore
abandon-2
Out[4]:
wordsentiment
Out[4]:
wordsentiment
abandon fear
abandon negative
abandon sadness
Out[4]:
wordsentiment
abandon negative
In [5]:
word = "bad"
AFINN[AFINN$word == word,]
BING[BING$word == word,]
NRC[NRC$word == word,]
LOUGHRAN[LOUGHRAN$word == word,]
Out[5]:
wordscore
bad-3
Out[5]:
wordsentiment
bad negative
Out[5]:
wordsentiment
bad anger
bad disgust
bad fear
bad negative
bad sadness
Out[5]:
wordsentiment
bad negative
In [6]:
word = "not"
AFINN[AFINN$word == word,]
BING[BING$word == word,]
NRC[NRC$word == word,]
LOUGHRAN[LOUGHRAN$word == word,]
Out[6]:
wordscore
Out[6]:
wordsentiment
Out[6]:
wordsentiment
Out[6]:
wordsentiment
In [7]:
word = "cool"
AFINN[AFINN$word == word,]
BING[BING$word == word,]
NRC[NRC$word == word,]
LOUGHRAN[LOUGHRAN$word == word,]
Out[7]:
wordscore
cool1
Out[7]:
wordsentiment
cool positive
Out[7]:
wordsentiment
cool positive
Out[7]:
wordsentiment
In [8]:
word = "egregious"
AFINN[AFINN$word == word,]
BING[BING$word == word,]
NRC[NRC$word == word,]
LOUGHRAN[LOUGHRAN$word == word,]
Out[8]:
wordscore
Out[8]:
wordsentiment
egregiousnegative
Out[8]:
wordsentiment
egregiousanger
egregiousdisgust
egregiousnegative
Out[8]:
wordsentiment
egregiousnegative
In [9]:
word = "strike"
AFINN[AFINN$word == word,]
BING[BING$word == word,]
NRC[NRC$word == word,]
LOUGHRAN[LOUGHRAN$word == word,]
Out[9]:
wordscore
strike-1
Out[9]:
wordsentiment
strike negative
Out[9]:
wordsentiment
strike anger
strike negative
Out[9]:
wordsentiment
  • AFINN: unigrams scored on a scale from -5 (negative) to 5 (positive)
  • BING: unigrams classified as negative/positive
  • NRC: unigrams assigned to a (possibly to more than one) sentiment category -- anger, anticipation, disgust, fear, joy, negative, positive, sadness, surprise, trust
  • LOUGHRAN: unigrams assigned to one of 6 sentiment categories -- constraining, litigious, negative, positive, superfluous, uncertainty

COMMENTS:

  • Specified lexicons can also be used. How are these lexicons validated? Does it make sense to use a social media lexicon to analyze emotional content in Shakespeare's plays?
  • The most suitable lexicon may change from project to project.
  • As a rule of thumb, we feel that applying sentiment analysis to any text that is currently intelligible without a slew of annotation can still yield insights.
  • It's easier to identifty a clear cut sentiment in a short text than in a long one.
  • bad is identified as a negative word, not is seen as neutral, but not bad would have to be a positive 2-gram. There are ways to avoid these issues, but we will only focus on unigrams in this workshop.

Back to top

PREPARING THE DATA

We start by creating a custom lexicon for the works of Shakespeare at the Gutemberg Project.

In [10]:
word = c("etext", "copyright", "implications", "electronic", "version", "william", "shakespeare",  "inc", "gutenberg", "electronic", "machine", "distributed", "commercially", "commercial", "distribution", "download", "shareware")
lexicon = rep("custom",17)
custom = data.frame(word,lexicon)
stop_words_custom_gut = rbind(stop_words,custom)

Now, prepare a tidy dataset for Slick Willy.

In [11]:
library(gutenbergr)
library(dplyr)
library(stringr) # necessary to use str_detect, str_extract

will_shakespeare <-gutenberg_download(c(1112,1524,2264,2242,2267,1120,1128,2243,23042,1526,1107,2253,1121,1103,2240,2268,1535,1126,1539,23046,1106,2251,2250,1790,2246,1114,1108,2262,1109,1537))

tidy_ws <- will_shakespeare %>% 
  unnest_tokens(word,text) %>%
  mutate(word = str_extract(word,"[a-z']+")) %>% # to make sure we're not picking up stray punctuation and odd encodings
  anti_join(stop_words_custom_gut) %>%  # removing the heading business 
  na.omit() # remove NAs
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
Using mirror http://aleph.gutenberg.org
Joining, by = "word"

Now, let's extract the surprise words from A Midsummer Night's Dream (according to the NRC lexicon):

In [12]:
nrc_surprise <- NRC %>% 
  filter(sentiment == "surprise") # to only keep the surprise terms

tidy_ws %>%
  filter(gutenberg_id == 2242) %>% # gutenmberg ID for MND
  inner_join(nrc_surprise) %>%
  count(word, sort = TRUE)
Joining, by = "word"
Out[12]:
wordn
sweet 33
art 16
death 14
pray 13
youth 7
catch 5
hope 4
lose 4
marry 4
spirits 4
fright 3
perchance 3
teach 3
chance 2
mouth 2
slip 2
stealth 2
trip 2
break 1
clown 1
gift 1
illusion 1
laughter 1
lightning 1
merriment 1
palpable 1
precious 1
saint 1
shot 1
shout 1
smile 1
sunny 1
tempest 1
tickle 1

What do you think?

For comparison's sake, let's also look at anger words.

In [13]:
nrc_anger <- NRC %>% 
  filter(sentiment == "anger") # to only keep the anger terms

tidy_ws %>%
  filter(gutenberg_id == 2242) %>% # gutenmberg ID for MND
  inner_join(nrc_anger) %>%
  count(word, sort = TRUE)
Joining, by = "word"
Out[13]:
wordn
rob 16
death 14
hate 9
youth 7
hell 6
words 6
strike 5
derision 4
force 4
lie 4
lose 4
mad 4
adder 3
beast 3
bee 3
fierce 3
honest 3
ill 3
offend 3
wound 3
angry 2
bloody 2
bully 2
confusion 2
curse 2
dame 2
delay 2
discord 2
fight 2
grim 2
⋮⋮
hood 1
hot 1
hunting 1
insufficiency1
killing 1
lightning 1
lying 1
mighty 1
miserable 1
odious 1
offended 1
poison 1
prison 1
prosecute 1
raging 1
riot 1
scorn 1
shot 1
shout 1
shun 1
sinister 1
stone 1
strife 1
tempest 1
throttle 1
torment 1
warrior 1
whip 1
wrath 1
wretch 1

Back to top

SENTIMENT ANALYSIS OF MACBETH

Instead of finding words that express a specific sentiment in each play, we are going to compute a score for various sections of Macbeth.

First, let's load the text from Macbeth.

In [14]:
library(tidyr) # to be able to use the spread functionality
library(dplyr) 
library(readr) # to be able to use read_lines

macbeth = read.csv("Data/Macbeth.csv",header=TRUE, sep=",", stringsAsFactors=FALSE)
str(macbeth)
'data.frame':	15221 obs. of  6 variables:
 $ Act       : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Scene     : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Speaker   : chr  "First Witch" "First Witch" "Second Witch" "Second Witch" ...
 $ Text      : chr  "When shall we three meet again" "In thunder, lightning, or in rain?" "When the hurlyburly's done," "When the battle's lost and won." ...
 $ Scene_Line: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Play_Line : int  1 2 3 4 5 6 7 8 9 10 ...

The Act and Scene variables could be combine to provide an increasing identifier for the play's sections.

Ultimately, we only want to keep information on the text, the line number, and the section.

In [15]:
macbeth$section=macbeth$Act*10+macbeth$Scene
table(macbeth$section)

macbeth <- macbeth %>% select(c("Text","Play_Line","section"))

head(macbeth)
Out[15]:
 11  12  13  14  15  16  17  21  22  23  24  31  32  33  34  35  36  41  42  43 
 13  76 171  65  82  37  94  72  91 180  51 156  62  32 168  36  55 173  94 281 
 51  52  53  54  55  56  57  58 
 74  37  71  27  57  11  35  86 
Out[15]:
TextPlay_Linesection
When shall we three meet again 1 11
In thunder, lightning, or in rain?2 11
When the hurlyburly's done, 3 11
When the battle's lost and won. 4 11
That will be ere the set of sun. 5 11
Where the place? 6 11

Now, let's unnest the tokens using word as a basic unit.

In [16]:
library(tidytext)
tidy_macbeth <- macbeth %>%
  unnest_tokens(word, Text)
head(tidy_macbeth,25)
Out[16]:
Play_Linesectionword
11 11 when
1.11 11 shall
1.21 11 we
1.31 11 three
1.41 11 meet
1.51 11 again
22 11 in
2.12 11 thunder
2.22 11 lightning
2.32 11 or
2.42 11 in
2.52 11 rain
33 11 when
3.13 11 the
3.23 11 hurlyburly's
3.33 11 done
44 11 when
4.14 11 the
4.24 11 battle's
4.34 11 lost
4.44 11 and
4.54 11 won
55 11 that
5.15 11 will
5.25 11 be

Then, we get a sentiment score for each word using the Bing lexicon (words that don't appear are considered to be neutral).

In [17]:
library(tidyr)

macbeth_SA <- tidy_macbeth %>%
  inner_join(get_sentiments("bing")) # we will use the bing lexicon to categorize words into negative and positive 

head(macbeth_SA)
dim(macbeth_SA)
Joining, by = "word"
Out[17]:
Play_Linesectionwordsentiment
4 11 lost negative
4 11 won positive
12 11 fair positive
12 11 foul negative
12 11 foul negative
12 11 fair positive
Out[17]:
  1. 1563
  2. 4

Next, we count the positive and negative words in each "section" of the book. This index counts up sections of $L$ lines of text.

In [18]:
macbeth_SA <- tidy_macbeth %>%
  inner_join(get_sentiments("bing")) %>%
  count(index = Play_Line %/% 30, sentiment) # here we're using L = 30
             

head(macbeth_SA)
dim(macbeth_SA)
Joining, by = "word"
Out[18]:
indexsentimentn
0 negative11
0 positive10
1 negative12
1 positive10
2 negative17
2 positive11
Out[18]:
  1. 160
  2. 3

The counts are stored in the variable $n$. Let's reshape the tibble into a tidy dataset (reminder: each column hosts 1 variable, each row 1 observation).

In [19]:
macbeth_SA <- tidy_macbeth %>%
  inner_join(get_sentiments("bing")) %>%
  count(index = Play_Line %/% 30, sentiment) %>%
  spread(sentiment, n, fill = 0) # we'll get 2 columns: 

head(macbeth_SA)
dim(macbeth_SA)
Joining, by = "word"
Out[19]:
indexnegativepositive
0 1110
1 1210
2 1711
3 8 2
4 815
5 9 9
Out[19]:
  1. 80
  2. 3

Finally, let's compute the overall sentiment for each block of lines as the difference between its positive and negative term counts.

In [20]:
macbeth_SA <- tidy_macbeth %>%
  inner_join(get_sentiments("bing")) %>%
  count(index = Play_Line %/% 30, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

head(macbeth_SA)
Joining, by = "word"
Out[20]:
indexnegativepositivesentiment
0 1110-1
1 1210-2
2 1711-6
3 8 2-6
4 815 7
5 9 9 0

And... well, that's it, really. It's fairly easy to plot the outcome.

In [21]:
library(ggplot2)

ggplot(macbeth_SA, aes(index, sentiment)) +
  geom_col(show.legend = TRUE)
Out[21]:

The overall picture seems to be somewhat negative -- but is that surprising? Macbeth is a tragedy, after all, arguably Shakespeare's darkest.

But perhaps what we're seeing is an artifact of the way we have blocked the play, or the length of the blocks, or even of the sentiment lexicon that we've elected to use. Let's look into this a little bit more.

Smaller number of blocks

In [22]:
macbeth_SA <- tidy_macbeth %>%
  inner_join(get_sentiments("bing")) %>%
  count(index = Play_Line %/% 50, sentiment) %>% # use n=50 instead
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

ggplot(macbeth_SA, aes(index, sentiment)) +
  geom_col(show.legend = TRUE)
Joining, by = "word"
Out[22]:

Different sectioning mechanism (Act and Scene separation instead of arbitrary number of lines)

In [23]:
macbeth_SA <- tidy_macbeth %>%
  inner_join(get_sentiments("bing")) %>%
  count(index = section, sentiment) %>% # use n=50 instead
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

ggplot(macbeth_SA, aes(index, sentiment)) +
  geom_col(show.legend = TRUE)
Joining, by = "word"
Out[23]:

Different lexicons

In [24]:
afinn_macbeth <- tidy_macbeth %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = Play_Line %/% 30) %>% 
  summarise(sentiment = sum(score)) %>% # because the caterogies are numerical, so sum instead of count
  mutate(method = "AFINN")

bing_nrc_loughran_macbeth <- bind_rows(
                        tidy_macbeth %>% 
                            inner_join(get_sentiments("bing")) %>%
                            mutate(method = "BING"),
                        tidy_macbeth %>% 
                            inner_join(get_sentiments("nrc") %>% 
                            filter(sentiment %in% c("positive","negative"))) %>% # because there are other sentiments in NRC
                            mutate(method = "NRC"),
                        tidy_macbeth %>%
                            inner_join(get_sentiments("loughran") %>%
                            filter(sentiment %in% c("positive","negative"))) %>% # because there are other sentiments in NRC
                            mutate(method = "LOUGHRAN")) %>%
  count(method, index = Play_Line %/% 30, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

bind_rows(afinn_macbeth, 
          bing_nrc_loughran_macbeth) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Out[24]:

so... what do you think? Is the evidence conclusive?


We can also look at how often specific words contribute to positive and negative sentiments.

In [25]:
bing_word_counts <- tidy_macbeth %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>% 
  ungroup()

head(bing_word_counts)
tail(bing_word_counts)
Joining, by = "word"
Out[25]:
wordsentimentn
good positive52
like positive42
fear negative35
well positive35
great positive31
death negative20
Out[25]:
wordsentimentn
wisely positive1
woeful negative1
wonders positive1
worn negative1
wrath negative1
wrongly negative1

And visualize them (bar charts, word clouds).

In [26]:
# bar charts
bing_word_counts %>%
  group_by(sentiment) %>% # will create 2 graphs
  top_n(10) %>% # pick only the top 10 in each category
  ungroup() %>% # required to avoid a warning message below
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) + # plot a bar chart of word count
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") + # there will 2 such bar charts, one for each sentiment
  labs(y = "Contribution to sentiment",x = NULL) +
  coord_flip() # horizontal bar charts

# wordcloud
library(wordcloud)

word = c("thou", "thy", "thee", "tis", "hath")
lexicon = rep("custom",5)
custom2 = data.frame(word,lexicon)
stop_words_custom_macbeth = rbind(stop_words,custom2)

tidy_macbeth %>%
  anti_join(stop_words_custom_macbeth) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))

# comparison cloud
library(reshape2)

tidy_macbeth %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%  # counting words for the whole play this time
  acast(word ~ sentiment, value.var = "n", fill = 0) %>% # reshaping as a matrix with acast() for comparison cloud
  comparison.cloud(colors = c("#660000", "#000066"),
                   max.words = 100)
Selecting by n
Loading required package: RColorBrewer
Joining, by = "word"
Out[26]:
Attaching package: ‘reshape2’

The following object is masked from ‘package:tidyr’:

    smiths

Joining, by = "word"
Out[26]:
Out[26]:

Nothing jumps at us as being amiss (which is no guarantee that there's no problem, but it's at least a good sign).


COMMENTS:

  • The choice of lexicon has an effect.
  • The choice of "window" has an effect.
  • It's easy to run a sentiment analysis (a few lines at most). It's hard to pick (or build) the right window and the right lexicon.
  • How could we train our sentiment analysis models?

Back to top

EXERCISES

  • run a lexicon comparison using Section ID as blocks
  • produce visualizations of frequent words for the other lexicons
  • run a sentiment analysis of Macbeth for other categories of sentiments
  • run a sentiment analysis of Trump's 5 tweets
  • run a sentiment analysis of the field SSS_Recap from the Senators Recap example
In [0]: